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Abstract 

We demonstrate how one can choose the smoothing parameter in 
image denoising by a statistical multiresolution criterion, both globally 
and locally. Using inhomogeneous diffusion and total variation regu- 
larization as examples for localized regularization schemes, we present 
an efficient method for locally adaptive image denoising. As expected, 
the smoothing parameter serves as an edge detector in this framework. 
Numerical examples illustrate the usefulness of our approach. We also 
present an application in confocal microscopy. 
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1 Introduction 



Image denoising is one of the main tasks in image analysis, as documented by 



numerous articles and books published on the subject, cf. e.g. (Scherzer et al. 
2009P , ( |Buades et aL]2005fr and ( |Aubert and Kornprobst|2002[ ). Accordingly 



statisticians have contributed their share: using probabilistic models for the 
true image, Bayesian methods were among the first to add a statistical per- 
spective to the subject, see e.g. (Geman and Geman 1984), (Besag 1986), 



and (Winkler 2003). Taking a frequentist's point of view, image denoising 



becomes a smoothing or reconstruction problem, see e.g. (Hall and Titter 



ington 


1986 


) and ( 


Polzehl and Spokoiny 


2000 


and Tsybakov||1993 


) . Acknowledging the non-s 



caused by sharp edges, cf. ( Chu et aT||1998 ) and ( Donoho||1999 ), prompted a 



generalization of the well-established smoothing techniques, e.g. drawing on 
methods from one-dimensional jump detection, see dQlu|2005l|2007| ); cf. also 



(Mrazek et al. 2006) for a unifying framework for many popular numerical 
and statistical denoising techniques. 

Put simply, statistical image denoising amounts to reconstructing a noise- 
less image / given a noisy image y. Usually, one assumes the noise e to be 
additive, i.e. y = f + e. In this article, we assume that the noise is generated 
at random; more specifically, we assume for pixels that 



Uij fij ~l" 



(1) 



with €ij identically and independently distributed Gaussian random variables 
with zero mean and variance a 2 , i.e. 



~ d A/-(0,a 2 



(2) 



in particular we assume the value at each pixel to be a real number, i.e. yij, 
fij, €ij e R. This models a grey-scale image, though in practice its grey levels 
are usually restricted to a finite number of discrete values, e.g. to integers 
between and 255. In many applications, a Gaussian assumption on the 
noise is therefore not very plausible but other noise processes will be more 
suitable. Nonetheless, for simplicity's sake we will lay out our basic ideas 
under a Gaussian assumption; possible extensions beyond this are briefly 
discussed at the end of Section|3j We note, however, that for a well-calibrated 
image y making good use of the range of (not too few) possible values, this 
assumption is not very crucial, as long as the errors are i.i.d., symmetric and 
feature no heavy tails: the true image / will then take values well inside the 
range, itself not necessarily being restricted to discrete values - a property it 
passes onto the errors e^ . 

Figure [T] shows an artificial example where / exhibits varying degrees of 
smoothness (left), and Gaussian white noise has been added to obtain the 
data y (right). The assumption about the noise can then be exploited in 
order to distinguish between the 'true', noiseless image / and the noise e as 
demonstrated in the following sections. 
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(a) 



(b) 



Figure 1: (a) 256 x 256 pixel test image / taking values in [0,5]; dashed line 
indicates row 64 shown in detail in Figure [8] (b) Simulated data y with noise 
level cr = l. 



Clearly, this is impossible unless assumptions are made about /, e.g. 
that it varies slowly from pixel to pixel. In order to be able to formulate such 
'smoothness' assumptions more precisely, we let (in a slight abuse of notation) 
fij denote the values of a function / on a regular grid, i.e. / : [0, l] 2 — > R 
with fij = /(^ -J where i,j G {1, . . . , n}. Note that we use a square grid 
for ease of notation, with Xij = (-, -) in the following; one easily can extend 
all the analyses to rectangular grids, to higher dimensions, or to non-uniform 



sampling schemes through the use of finite elements, if required, cf. e.g. (Ern 



and Guermond|[2004 ) . Now that / can be viewed as a function, 'smoothness 



can be defined more rigorously to mean that / belongs to some function class, 
e.g. that / is in a Sobolev or Besov ball, or that / has bounded variation, 
cf. e.g. (Korostelev and Tsybakov 1993). 

Image denoising techniques like the ones we discuss in Section [2] gener- 
ally require the choice of a smoothing or regularization parameter a which 
determines how much smoothing is to be applied. This parameter might 
be localized, allowing for different amounts of smoothing to be applied to 
different parts of the image; the regularization parameter then becomes a 
function a : [0, l] 2 — > R. Since one might want to smooth more where the 
true image / is smoother, this then enables one to adapt to differing levels of 
smoothness across the image. The reconstruction or denoised image / hence 
depends on that smoothness parameter. The aim of this research was to find 
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a purely data-driven and generally applicable way to choose a. This will be 
illustrated with two specific denoising techniques, namely for linear diffusion 
as well as for TV penalization. We stress, however, that our approach is in 
principle applicable to any regularization technique which depends on prop- 
erly choosing a regularization parameter, the latter possibly being a function 
as described above. 

The main idea can be summarized as follows: consider the residuals = 
Hij — fij] they depend on the smoothness parameter a. Indeed, if we smooth 
too much some structures which were present in / have been smoothed away 
- these structures then will be left in the residuals. Had we found the perfect 
reconstruction, i.e. for f = f, however, the residuals form white noise, see 
([T]). One possible way to decide whether we smoothed too much is therefore 
to check whether the residuals look like white noise - if there is still some 
structure left in them we must have smoothed too much. We note that this 
idea is at the heart of statistical methods for automatically selecting the 
regularization parameter. 

As the key ingredient for choosing the smoothness parameter a we propose 
a statistical multiresolution criterion. Introduced in Section |3j it measures 
deviations from the hypothesis that the residuals are white noise. An impor- 
tant feature of this criterion is that it not only detects if the residuals deviate 
from white noise but also where. This then allows for a locally adaptive choice 
of a. In Section [4] we derive an algorithm for a data-driven selection of a. 
Numerical details and results are given in Section [5j alongside an example 
from confocal microscopy. Finally, we summarize and discuss what we have 
achieved in Section [6] The interested reader might also find the Ph.D. thesis 
of Stichtenoth (2007) useful, on which parts of this article have been based. 

While there exists an extensive literature on localized, data-driven ways 
of choosing the smoothing parameter in one- dimensional function estimation, 
see e.g. (Lepskii et al. 1997), (Gijbels and Mammen 1998) and (Dumbgen 



and Spokoiny 2001), and several articles have been published on multireso- 



lution criteria for one-dimensional settings, cf. the references in Section |3j 
higher-dimensional situations have scarcely been treated. In fact, to the au- 
thors' knowledge ( |Bissantz et"aT]|2006[ |2008[ ) and ( |Davies and Meise||2008[ ) 
are the only published work on a data-driven choice of the smoothing param- 
eter by statistical multiresolution techniques in higher dimensions; however, 



Bissantz et al. (2006, 2008) essentially still apply a one-dimensional version 



of the multiresolution criterion to this end whereas the two-dimensional ap- 
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plication in (Davies and Meise 2008) is described very briefly and somewhat 
rudimentary. 

The present paper therefore appears to be the first to give a comprehen- 
sive and detailed exposition of a two-dimensional mult iresolut ion criterion, 
and also in proposing a method for choosing a localized smoothing param- 
eter in a purely data-driven and general fashion. We claim that it can be 
applied to a wide range of reconstruction techniques depending on a global 
or local regularization parameter; this will be demonstrated exemplarily for 
linear diffusion filtering and total variation regularization below. 

Let us now consider these commonly used methods for image reconstruc- 
tion, i.e. for noise removal. 



2 Image denoising 

Among the best studied denoising technique in image processing is the linear 



diffusion filter, see e.g. (Weickert 1998) or (Guichard et al. 2004), which has 



the physical interpretation of an evolution of the heat equation. For constant 
diffusivity a G R, it is given by 



§- t u a (x,t) = a A x u a (x,t) 
u a (x,0) = y 



(3) 



which is solved by the convolution of the initial data y : R 2 — > R with a 
Gaussian kernel, i.e. 



u a (x*,t) = / K^al(x - x*)y(x)dx 



with 



K h (x) 



cxp 



\x\ 



(4) 



(5) 



2nh r V 2h 2 

it is therefore also called Gaussian filtering. This diffusion filter is very well 
understood by now: it is a low-pass filter effectively reducing white noise; the 
heat equation is the limit of repeated convolving with any kernel fulfilling 
some moment conditions ( Guichard et al.|2004 Ch. 2); it gives rise to a linear 
scale space with many desirable properties (Chaudhuri and Marron 2000); 



and there are many sets of such properties which uniquely determine this 
scale space, see e.g. ( |Witkin|[l983| , ( |Koenderink|[l984| , ( |Alvarez et aLl[l993| ), 
or (Weickert 1998, Table 1.1) for an overview. 



Note that the aim of this paper is not to propagate this specific filter 
but rather to illustrate with this particular example how statistical multires- 
olution analysis can be used to effectively choose the diffusivity, or some 
other regularization parameter, globally or locally - even in the case when 
the optimal choice depends on the smoothness of the true image /; other 
regularization techniques will be discussed at the end of this section. 

In statistics, model 0, where / is not assumed to have a certain low- 
dimensional, parametric form but instead to belong to some non-parametric 
function class, is called a (two-dimensional) nonparametric regression model. 
A standard approach to estimate / from the data y is to use a kernel estimator 
fx' choosing some kernel K : R 2 — > R, one forms a local average where the 
kernel defines the weights, 

for arbitrary G [0, l] 2 . A popular choice for K is the isotropic Gaussian 
kernel Kh with bandwidth h given in We refer to (Wand and Jones 



1995) for a broader treatment of kernel estimation, mentioning just one of 
the many textbooks on the subject. 

From Q we see that, after discretization in space and ignoring boundary 
effects, the solution of the heat equation at time t is the kernel estimator 
with bandwidth h = y/2at, i.e. the role the (global) bandwidth is playing in 
kernel estimation is played by time when smoothing with the heat equation. 
Alternatively, we can vary the diffusivity and fix the endpoint in time since 
Ui(x,t) = u t (x, 1). We will take the latter approach, calling the process 
homogeneous diffusion, with the diffusivity a G R to be specified. Finally, 
we also discretize time to only two time points, namely and 1, thereby 
obtaining a one-step approximation to the heat equation which is computed 
on the same grid as the data are on. For a G R, this defines an estimator 
Aom.diff. through 

/"horn, diflf. ~ U — a ^xfhom.diS. (7) 

where A x is a discretization of the Laplacian in space. 

Clearly, the choice of the diffusivity - or, equivalently, of the bandwidth 
- is critical for the performance of the estimator: a large diffusivity will 
reduce the variance of the estimator but at the expense of increasing its 
bias. While the variance depends on the noise level a, the smoothness of the 
function influences the bias: the smoother the function, the less the bias, with 
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(a) (b) 

Figure 2: Homogeneous diffusion applied to the data in Figure [jjb), with 
diffusivity chosen as a = 40 in (a), and a = 0.1 in (b). 

constant functions giving no bias at all. This is illustrated in Figure [2] while 
a large diffusivity (left) is permissive in regions where the underlying signal, 
see Figure pTa), is smooth, e.g. in the interior of the large circle or within 
the background, sharp edges and smaller features like the circles in the lower 
right or the "valleys" in the lower left are smoothed away; the latter need a 
lower diffusivity (right) but this then results in unnecessary and unwanted 
undersmoothing in the upper left part. The optimal diffusivity strikes a 
balance between bias and variance - a goal which is difficult to achieve, 
given that the bias depends on the smoothness of the unknown function. An 
estimator that chooses the diffusivity in a purely data-driven manner will be 
called adaptive. 

Moreover, instead of choosing a global, one-fits-all, diffusivity a, one 
would rather want to choose it locally, as the function's smoothness is a 
local feature, too, cf. again Figure [2} Estimators that choose the diffusivity 
locally in dependence on the data observed are locally adaptive. We stress 
that in the context of our work "adaptivity" is not meant in the sense that the 
estimator attains certain optimality properties over scales of spaces (see e.g. 
QLepskii|1990"l ), QTsybakov|19~98| and dDtimbgen and Spokoiny|2l)0"T] )), but we 



rather use this term heuristically to express that the estimator chooses e.g. 
its diffusivity in a data-driven fashion, possibly taking local smoothers of / 
into account. Deriving such an estimator is the aim of this paper. It will be 
achieved by the use of the statistical multiresolution criterion which allows to 
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statistically decide whether some reconstruction's residuals still contain parts 
of the signal. We defer to Section [3] for a longer discussion of its analytical 
properties and to Section [4] for its application to image reconstruction. 

The task of choosing the bandwidth locally is similar to selecting a spa- 
tially varying diffusivity a(x) in d3|), and correspondingly in ([7]). We point 
out that, nonetheless, these techniques are not equivalent; this can easily 
be seen if one compares the effect of a zero diffusivity along a curve which 
inhibits the exchange of information across even if the diffusivity is large 
close to that curve. Hence, local diffusivities allow to respect sharp edges. 
Note that in the case of a local diffusivity a : [0, l] 2 — > R, equation ^ can 
still be solved very efficiently as opposed to the computation of kernel esti- 
mators with varying bandwidths, see Section [5] for details. We thus obtain 
an estimator /inhom.diff. through this inhomogeneous diffusion process. The 
corresponding linear operator will be denoted by L a , i.e. 



/inhom.diff. = L a y = a A x L a y + y. 



(8) 



For details on how to solve this equation as well as on differences to local 



bandwidth selection, we refer to (Stichtenoth 2007). 



The variational formulation of the continuous version of (ph is given by 



argmm 



g-y 



a 



(9) 



where || ■ || and | ■ | denote the L2-norm and the Euclidean norm in R 2 , 
respectively. Here, if 2 denotes the Sobolev space of functions in L 2 having 
weak second derivatives also in L 2 , i.e. H 2 = {g : ^i a | <2 Il-D^ll < °°}- 
Clearly, the first term in ^ is a weighted data fit whereas the second term 
is a smoothness penalty; this allows to view a also as local weights, or being 
proportional to a locally varying standard deviation a of the errors. 

Other penalties can also be thought of: considering the poor performance 
of the kernel estimator at sharp edges, one might also be interested in a total 



variation (TV) penalty as introduced by Rudin et al. ( 1992 ) 



TV 



argmm 

geBV 



g-y 



(I 



+ 



IV0I 



(10) 



where the second term symbolically stands for the TV semi-norm, i.e. J \Vg\ 
su P0eci(R"),||0||oo<i/# div ^ where e Cl(R n ) iff G C x (R n ) and <p has 
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compact support. We also assume a discretization in space as above; as 
usual, BV denotes the space of functions with bounded variation, defined as 
BV = {geL 1 : / |V<?| < oo}. 

The main question that needs to be answered is how to choose the dif- 
fusivity (or weights) a. This will be addressed in the following two sections. 
We note that it is not always possible to localize the regularization parame- 
ter. One class of denoising techniques for which this is the case are iterative 
methods; there, the iteration at which to stop the algorithm plays the role 
of the regularization parameter, see e.g. (Bissantz et al. 2006, 2008) for the 
EM algorithm, a.k.a. Richardson-Lucy algorithm, or (Bissantz et al. 2007) 
for a general treatment of iterative regularization schemes. We therefore also 
discuss how to select a globally, although the focus will be on a local selection 
strategy. 



3 A statistical multiresolution criterion 

We have assumed the errors ey to be white noise, see Q. Clearly, considering 
the residuals ry = Uij — fij for some estimator /, we would want them to be as 
close to white noise as possible - if there is any structure left in the residuals 
then the estimator must have missed essential features of the true /. In this 
section, we therefore aim to derive a statistical test for the hypothesis that 
the residuals are white noise, i.e. ry '~ d A/"(0, cr 2 ), against the alternative that 
there is some structure left in them. To this end we define random variables 

for suitable subsets PC [0, l] 2 of the image domain, to be specified later. 
We call up the multiresolution coefficient of P. The collection V n of subsets 
P cannot be too large and will typically be of polynomial order 0(n k ) of the 
number of observations n. This is the case if V n is a Vapnik-Cervonenkis 



(VC) class of subsets, cf. (Pollard 1984; Devroye and Lugosi 2001). In 



fact the family V n must be a VC-class for the procedure to make sense as 
otherwise too many subsets are picked out and the stochastic fluctuation 
of the random variables up can no longer be controlled simultaneously over 
all subsets P G V n . The choice of V n is subtle as it influences the limit 
behaviour; it will be addressed later where we propose amongst others a 
dyadic squares partitioning. 
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Note that up ~ A/"(0, a 2 ) under the hypothesis. If however there is some 
information left in the residuals then the mean of some residuals will no 
longer be 0. Hence the corresponding multiresolution coefficients up are also 
no longer distributed around but around the mean of the signal left in 
the residuals, thus becoming large in absolute value in comparison to the 
expected behaviour of white noise. 

Given an appropriate collection of subsets V n of [0, l] 2 and a,0 < a < 1, 
we can determine (e.g. through simulations) some r n = r n (a) such that 



max — . IJ = < V r n \ogn 2 = a, (12) 

where are standard Gaussian white noise random variables. For any 
function g : [0, l] 2 — > R and P G V n we define 

/ D \ ^XijepiVij ~ 9ij) 

W ^ 9 > p) = — m c pi (13) 

where gij = g(i/n,j/n) and put 



T n = Fn(Vm<J,T n ) = {g : max \ w(g,P)\ < a^Tnhgn 2 }. (14) 



Pev n 



For data generated under ([T]) with = aZij it is seen that JF n is an exact 
universal and non-asymptotic confidence region for any / : [0, l] 2 — > R: 

P(/ 6J„)=« 



provided this event is measurable, see (Davies et al. 2009). The confidence 
region J- n contains many functions which are of little interest, for example, 
all functions which interpolate the data. This may be seen as a severe case 
of overfitting. In general, however, we are interested in simple or, if possible, 
the simplest functions in JF n where simplicity may be defined in terms of 
smoothness, shape or sparsity or some combination of all three. Regulariza- 
tion within JF n leads to optimization problems such as 

minimize TV(g (k) ) subject to g G 7 n for some k = 0,1, . . . (15) 

where TV(g^) denotes some definition of total variation of the function g( k \ 



For examples of this approach for one-dimensional data we refer to (Mammen 



and van de Geer 


1997 


) and ( 


Davies et al. 


2009) 
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In some cases where the optimization problem is algorithmically too dif- 



ficult to be solved we may nevertheless have a sequence /„ 



m 



1,2, 



of 



good candidate functions of increasing complexity. This is the case we con- 
sider in this paper and the strategy consists of choosing the first function f m 
which lies in T n . The success of this strategy depends largely on how good 
the candidate functions are. 

The definition of T n involves a which has to be estimated from the data. 
We propose a robust estimator for this purpose in Section g) see (|2§. We 
note that it would be possible to refine the simple substitution of a n for a 



slightly so that T n becomes an honest (Li 1989 Genovese and Wasserman 



2008), universal and non-asymptotic confidence region for /, i.e. 



P(/ e F n ) > a. 

As already mentioned r n {a) can be obtained by simulations. However, if 
the asymptotic behaviour of r n can be determined then it is often sufficient 
to use Too = lim^oo r n . This is not a simple problem and we consider it in 
more detail below. In the cases we consider we have r ro = 2. 

Following these considerations, we define the statistical multiresolution 
criterion M by 



1 



1 



7? 



= max \up\ 
2 Pev n 



max 



E 



v/2loi^ P&n ^{ Xij ePy 



(16) 



for some collection V n of subsets P, the normalization factor a/2 log n 2 , or 
a/ 2 log n d in a rf-dimensional setting, becoming clear in a short while. Note 
that the choice of V n is subtle as it influences the limit behaviour; it will be 
addressed later where we propose to use a dyadic squares partitioning. 



The multiresolution criterion in the form of ( 16 ) was introduced by Sieg 



mund and Venkatraman (1995) to detect change points; Davies and Kovac 



(2001) were first to use it for one-dimensional non-parametric regression, Bis- 



santz et al. (2006, 2008) applied it to positron emission tomography, while 



a similar criterion has been introduced by Diimbgen and Spokoiny (2001) as 



well as Diimbgen and Walther (2008) in the context of testing qualitative 



hypotheses in non-parametric regression. A major difference between (16) 



and the latter authors' multiscale statistic is that they calibrate the average 
residuals by a term of the order ^J\og(n / §P) in order to enhance medium 
and large scales. In many imaging problems, however, the features on the 
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small scales are most important as they reflect rapid local change at edges. 
This particularly results in a completely different limiting behaviour, com- 
pare Theorem 2 in (Diimbgen and Walther 2008) with Theorem [I] below. 
Siegmund and Worsley (1995) test for the presence of a signal of unknown 



scale and position in arbitrary dimensions, using Gaussian weights in their 
multiresolution criterion; they derive its asymptotic distribution and discuss 
its power. Davies and Meise (2008) used this criterion in one- and two- 



dimenstional settings to determine the weights of smoothing splines. 

Although the cop are identically distributed they are dependent, rendering 
the distribution of M n difficult to obtain analytically. Recently, |Kabluchko 
and Munk (2008) established a.s. convergence, i.e. 



M„ —> a a.s. for n — > oo 



(17) 



for V n the collection of all squares and rectangles; for the one-dimensional 



case and intervals this was shown by Shao (1995). For the latter case, Sieg- 



mund and Venkatraman (1995) proved that M n is asymptotically Gumbel- 



distributed, cf. also (Kabluchko 2007); (Siegmund and Yakir 2000) suggests 



this also to hold in the multi-dimensional case. We will present a similar 
result for the collection of dyadic cubes in Theorem [2] on page 15 It will be 
based on the following, general theorem which is proved in the appendix: 



Theorem 1. For each N eN, let (£} 



(TV) 



■ i SjV J 



be a Gaussian vector with 



standardized marginal distributions. Suppose that for every e > and some 
constant p < 1 not depending on N we have 

#{(», j) E {1, . . . ,iV} 2 : Cov(ef ? 0} = 0(iV 1+£ ) as N -> oo (18) 



and 



Then 



< p provided that i ^ j. 



lim P 



max f 

i=l,...JV 



(N) 



< a N + b N r 



exp(— e 



where a N and b^ are sequences of constants defined by 

-l/21oglogA^-log2 v ^F 



V / 21ogiV + 



V2 log AT 



>N 



V2 log iV' 



(19) 
(20) 

(21) 
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We note that condition (18) states that the covariance matrix of (£,• )»=i jv 
is "sparse", that is, it contains at most 0(N 1+e ) non-zero elements. Condi- 
tion (19) states that the covariance matrix has mutual coherence less than 
1, i.e. its off-diagonal elements are bounded away from ±1. 

While the exact distribution of M n might not be available for more general 
collections V n or in a finite setting, a critical value for testing the hypothesis 
can in principle be obtained through simulation. 

We note that extensions to more general error distributions are possible, 
see phao|[l995| , ( |Kabluchko and Munk|[2008| or QNardi et al.|[2008| . For the 
particular case of Poisson data with not too small intensities, we suggest to 
approximate these by Gaussian distributions: hypothesising that the data y 
stem from intensities /, compute residuals = fZ (yij — fij) which are 
approximately i.i.d. like Gaussian white noise with standard deviation 1, see 
also Section [5l 



4 Data-driven choice of the diffusivity 

We now return to the question of how to choose the diffusivities for the 



estimators defined in Section |2} In view of the test criterion in (16), we 
require that the conditions 

\u P \ < a ^5 logn 2 (22) 



hold for all P 6 V n . The asymptotics in (17) suggest 5 > to be chosen 
close to 2, in such a way that the error of the first kind is below a certain, 
prespecified significance level a, say a = 5%. If all these conditions are 
fulfilled, we cannot reject the hypothesis that the residuals are white noise, 
or equivalently that the reconstruction / is the true function /; thence we 
are to accept / as possibly being the true /. 



An important property of the inequalities (22) to note is that they are 
only one-sided, i.e. they are only violated if the multiresolution coefficients 
up get large in absolute value. This will happen if we oversmooth, using 
too large a diffusivity. If we smooth too little, using too small a diffusivity, 
however, then the residuals get too small and so do the up, and the criterion 
will not be violated. For example, estimating / by / = y leads to all residuals 

being 0, and hence the multiresolution criterion is trivially fulfilled. 

Nonetheless, in many situations the collection V n together with the can- 
didate sets T n constitute a nested sequence of increasing complexity when 
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r n increases in (14); then it is reasonable to choose f n G J- n such that the 



maximum of the left hand side of (22) is as close as possible to equality. 



Bissantz et al. (2006, 2008) have investigated this strategy for stopping the 
EM algorithm in positron emission tomography. The reasoning behind is 
that equality is in fact obtained asymptotically as n — > oo, cf. (17) and 



Theorem [2] Alternatively, one can choose the 'simplest' function according 
to some complexity criterion, e.g. the TV-norm as in (15). 



Global choice. In light of the last remark, our strategy is to determine 
the largest diffusivity - or the smoothest solution - such that the inequalities 
(22) hold. For the homogeneous diffusion process 0, this can be achieved 
algorithmically by starting with a large diffusivity a G R + which leads to 
oversmoothing, i.e. the corresponding estimator /hom.diff. leads to residuals 

Then, the diffusivity is 



ry that violate the multiresolution criterion (22). 



reduced by a prespecified amount until the corresponding estimator gives 
residuals that fulfil the criterion. We refer to Section |5] for more details. 



Davies and Kovac (2001) have shown that such a strategy leads to con- 



sistent estimators when combined with the one-dimensional taut string, i.e. 
when the global smoothing parameter is given by the number of extreme val- 
ues of the taut string estimator / : R — > R; an application to the selection of 



peaks in X-ray diffractograms is given in (Davies et al. 2008). Boysen et al. 



(2007, 2009) proved a similar consistency result for selecting the number of 



jumps m jump regression. 



Dyadic squares partitioning. We yet have to specify what collection 
V n of subsets we will use; for the sake of brevity such collections will be 
called partitionings. Obviously, they do not constitute what is commonly 
referred to as a partition of a set, i.e. a disjoint composition of the latter, 
but they rather comprise several partitions at different scales as we shall 
see. Indeed, we want such a partitioning to consist of subsets that allow to 
detect deviations from the hypothesis at different resolutions, both at coarse 
and fine scales. However, it should not be chosen too rich either. From a 
statistical point of view, taking too many subsets results in too many multiple 
tests being performed which can be seen analytically from the asymptotics 
in (17) breaking down. Also, for each subset P G V n its multiresolution 



coefficient ojp needs to be determined in our iterative procedure, rendering 
a large partitioning V n computationally very expensive. Furthermore it is 



14 



also necessary that the condition for each individual P G V n can be quickly 
checked. We now show how all these conditions can be met. 

The subsequent theory is based on a dyadic squares partitioning, whereas 
for most practical purposes we use more subtle partitionings such as wedgelets 
or curvelets, see below. The dyadic squares partitioning originates from 



(Donoho 1997), cf. also (Kolaczyk et al. 2005) and (Antoniadis et al. 2009) 



for applications to image segmentation and classification, resp. It is obtained 
by splitting the image recursively into four equal subsquares until some pre- 
specified lowest scale is reached, cf. Figure [3] (left). This partitioning covers a 
wide range of scales with comparatively few subsets. At the same time it al- 
lows fast computation of the multiresolution coefficients through cumulative 
sums: define the matrix R of cumulative sums by 



R 



EE 

v fc=i i=i 



Tkl 



From R, the multiresolution coefficient wp of a rectangle P 
can readily be obtained by 



(23) 



x \ji,j 2 ] 



r [ x ij) — Ri2,h ^ii-lj2-"-{»i>i} ^«2,ii-l-"-{ii>l} 



(24) 

As promised in Section |3j we will now give the asymptotic distribution of 
M n in (16) for a dyadic partitioning of a cube K n = {1, . . . , n} d in arbitrary 



dimension d. First, let us recall the following well-known fact, see e.g. (Lead- 



better et al. 1983, Theorem 1.5.3). If {r]i,i G N} are independent standard 
Gaussian random variables, then for every r G R 



lim P 

n— too 



max rji < a n + b n r 

i=l,...,n 



exp(— e 



(25) 



where the normalising sequences a n , b n are defined in (21). As it turns out, 



we obtain the same asymptotic distribution, a Gumbel distribution in the 
case of a dyadic partitioning: 

Theorem 2. // V n is the collection of dyadic subcubes of K n and (ri)i & K n 
are independent standard Gaussian random variables, then for every r G R ; 



lim P[M„ < a%p n + 6) P „r] = exp(-e~ 



(26) 
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Figure 3: Different scales of a dyadic squares partitioning (left), and with a 
line cutting through thereby creating wedgelets (right). 



where 



Mr, = max 



We note that other "dyadic-type" collections of scanning subsets (say, p-adic 
cubes, p — 2,3, . . .) will also lead to this asymptotic distribution, as can be 
seen from the proof in the appendix. 

Local smoothing. If we want to allow the diffusivity to vary locally as for 
the inhomogeneous diffusion introduced in Section |2j we have to modify the 
approach taken for the homogeneous diffusion above. It is the locality of the 
multiresolution criterion that allows for such a modification: if the criterion is 
violated, we can not only conclude that the diffusivity was too high but from 



(22) we can also infer where. Accordingly, each time a violation occurs for a 
multiresolution coefficient ujp of some P G V n , we reduce the local smoothing 
parameter a(x) on that particular subset P only, keeping the diffusivity at 
its current value in the rest of the image. 

As for the homogeneous diffusion, we start by initializing a(x) to a large 
constant. Then we compute the estimator and its residuals, check the mul- 
tiresolution criterion and adapt the smoothing parameter locally. This strat- 



egy has first been introduced by Meise (2004). Obviously, if the hypothesis 



is violated on a small subset P, we can logically conclude that the hypothesis 
must also be violated on any superset P' D P. Note that this is not to say 
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that if the empirical criterion is violated on a subset that it will also be 
violated on all supersets - this is clearly not true in general; however, if we 
decide against the hypothesis on such a subset we cannot accept the hypoth- 
esis on any superset. Hence we start by checking the multiresolution criterion 
on the smallest scale, considering larger sets only if no subset has shown a 
violation yet, thereby avoiding the diffusivity being reduced several times at 
the same spot. We then iterate the process until no further violations of the 



inequalities in (22) are detected, giving our final estimate. 



Note that this results in the diffusivity being piecewise constant on the 
dyadic squares partitioning. In order to be able to adapt it to finer geometric 
features, we enhance the latter by adding so-called wedgelets. These are 
obtained by dividing a dyadic square into two parts by a straight line. We 
draw a certain set of lines through the image domain and add the resulting 
wedgelets to the partitioning. This is illustrated in Figure [3| (right) . For 
a detailed description of wedgelet partitionings we refer to (Donoho 1999) 



and (Friedrich 2005). Fast computation of the multiresolution coefficients of 



wedgelets can be done by a similar though somewhat more involved argument 



as in (24); for details we refer to (Friedrich et al. 2007). 



We use the wedgelet partitioning only to increase the flexibility when 
updating the diffusivity: if a violation is detected on a dyadic square P, all 
wedgelets subdividing that square are considered. If the criterion is fulfilled 
for all wedgelets contained in P the smoothing parameter will be reduced 
on P. Otherwise, the smoothing parameter will be reduced on the wedgelet 
W which yields the largest absolute value of the multiresolution coefficient 
u-\y. Again, all supersets of P will be ignored in that iteration. The final 
algorithm is described in Figure |4| 

For this procedure to be implemented, we need an estimator of the noise's 
standard deviation o. Based on the normality assumption, and taking the 
small number of pixels at sharp edges into account, we use a robust estimator 
based on the median of absolute differences, namely 



1 



a 



2$- 1 (0.75) 



Med {\y id - y^ ld - + Vi-ij-i | 



n 



}, (28) 



where $ denotes the cumulative distribution function of A/"(0, 1), cf. (Davies 



and Kovac 2001). For smooth signals, polynomially weighted, difference- 



based estimators might be more appropriate, see (Munk et al. 2005); note 
that our estimator is indeed unbiased if / is affine. 
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■ choose some 5 > 0, e.g. such that the error of the first kind is 5% 
which can be obtained from simulations 

■ determine a 

■ initialize a : {1, . . . ,n} 2 — > R + to a large constant 
loop 

• obtain current reconstruction /(•) = L a y, see ([8]) 

• compute residuals ry = yij — /ij 

• for each dyadic square P determine its multiresolution coefficient 

^Tmm^;- (27) 



if there is some P on which the criterion is violated, 
i.e. having \up\ > log n 2 , then 
for all such P do 

if there is some subset P' C P with a violation then 

• do nothing 
else 

• determine the multiresolution coefficients uw of all wedgelets 
W comprising this square 

if maxw \oow\ > &\/ 8 logn 2 , i.e. there is some wedgelet with a 
large coefficient, then 

■ reduce a on the wedgelet W = argmax^ uiw 
else 

■ reduce a on P 
end if 

end if 
end for 
else 

• stop and return the current estimate /, i.e. the first estimate 
without a violation 
end if 
end loop 

Figure 4: Algorithm to determine local diffusivity and hence the recon- 
structed image. 
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The algorithms described in this section clearly do not depend on the kind 
of penalty used, cf. (j9|. All that is needed is a global or local smoothness pa- 
rameter a, corresponding to the homogeneous or inhomogeneous diffusivity. 
In particular, we proceed in the same way for the TV regularization in (10). 
See ( Davies and Meise 2008 ) for an application of this approach to one- and 
two-dimensional weighted smoothing splines. 



5 Numerical details and results 

In this section we show simulations where the noiseless image / of is the 
256 x 256 pixel test image shown in Figure [l](a), whose values spread over 
[0, 5]. To this we added Gaussian noise according to Q with (7 = 1, resulting 
in a signal-to-noise ratio of 5, see Figure [jjb). 

In each iteration, the smoothing parameter was reduced where necessary 
through multiplication with a factor A < 1 as described in Section |4} In order 
to speed up the algorithm we chose A according to the size of the multires- 
olution coefficient: the larger up, the smaller A was chosen, taking care not 
too quickly to reduce the diffusivity which would result in undersmoothing. 
Note that the inhomogeneous diffusion estimator can be found efficiently by 
solving ^ since the discretized Laplacian is given by a sparse band matrix, 
rendering Gauss-Seidel iterations an appropriate solution method. 

The algorithms were implemented in Matlab in a modular fashion, thus 
allowing to use any localized regression method as discussed in Section [2} Fig- 
ure [5] shows results for homogeneous and inhomogeneous diffusion. Clearly, 
in order to reconstruct fine details and sharp edges well, such that the mul- 



tiresolution conditions (22) are satisfied, a very small diffusivity (a = 0.934) 
is needed when chosen globally; this results in considerable undersmoothing 
elsewhere in Figure [5]^a). Choosing the diffusivity locally resolves this diffi- 
culty, allowing e.g. to strongly smooth large areas of the background or the 
circle in the upper left while not compromising small level details like the 
dots in the lower right or any sharp edges, see Figures ^b) and[8^a). This 
behaviour of the local diffusivity as an edge-detector is very much apparent 
in Figure [7^a). 

For comparison, we also show results for the total-variation (TV) reg- 



ularization introduced in (10), again both for global and local smoothing 
parameters; see Figure [6^a) and (b), respectively. The difficulty when choos- 
ing a global smoothing parameter is again well visible but the local smoothing 
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Figure 5: Results of homogeneous (a) and inhomogeneous (b) diffusion. 




Figure 6: Results for TV regularization with global (a) and local (b) choice 
of the smoothing parameter. 



parameter appears to have more difficulty adapting to the test object, per- 
mitting considerable undersmoothing on the larger dots in the lower right, cf. 
Figures [7](b) and|8^b). The TV penalty was implemented by approximation 
with a differentiable functional, as described e.g. by Vogel (2002). 



When the signal-to- noise ratio is reduced to 2, see the simulated data in 
Figure [9]^a), the reconstruction gets smoothed more strongly (b), since the 
residuals no longer carry enough statistically significant information to allow 
the oversmoothing of the smaller features to be detected, though still the 
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Figure 7: Local smoothing parameters for diffusion (a) and TV penalty (b). 
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Figure 8: Cut along row 64 indicated in Figure [jja) for diffusion (a) and 
TV penalty (b); solid lines gives the true /, points the data y, dotted lines 
correspond to a global and dashed lines to a local choice of the smoothing 
parameter. 
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only feature not distinguishable from the noise is the smallest of the nine 
circles in the lower right. 

Poisson data with high intensities can also be treated with this method- 
ology as remarked at the end of Section [3] instead of determining a constant 
variance at the beginning of the algorithm, one uses the local variance pre- 
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(a) 



(b) 



Figure 9: (a) Simulated data y with noise level a 
reconstruction using inhomogeneous diffusion. 



2.5. (b) Corresponding 




Figure 10: (a) Simulated Poisson data y with intensities in [50,100]. (b) 
Corresponding reconstruction using inhomogeneous diffusion. 



dieted by the reconstruction to normalize the residuals. We also simulated 
this situation for intensities within [50, 100], i.e. again with a signal-to-noise 



ratio of 5, see Figure [10] The reconstruction demonstrates the applicability 
of our approach also in this situation. 

We conclude this section with an application where the image has been 



obtained by a CCD camera attached to a confocal microscope, see Figure 11 



data courtesy of Emre Togan, Department of Applied Physics, Harvard Uni- 
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versity. The data set shows photoluminescence in a diamond sample, where 
high photoluminescence marks the so-called nitrogene-vacancy centres in the 
diamond. These are of major interest to quantum information science where 
they have been proposed as qubits, i.e. for storage, as they form a solid- 
state system whose spins can be manipulated at room temperature. This 
application therefore aims at removing noise from the image in order to aid 
the researcher in detecting these nitrogene-vacancy centres, such that sub- 
sequent experiments can be conducted on them, see (Dutt et al. 2007) and 
the references therein. We note that our reconstruction reduces the noise 
considerably while keeping small-scale features, much to the satisfaction of 
the physicists involved. 



6 Discussion 

We have demonstrated that the statistical multiresolution criterion allows to 
determine both global and local smoothing parameters in a fully data-driven 
procedure. For inhomogeneous diffusion, it acts as an edge-detector - quite 
as expected: at edges, only very little smoothing is to be permitted for the 
criterion to be fulfilled there, while large regions that are well approximated 
by their first order Taylor series expansion may be smoothed strongly. 

It appears that using a penalty that is more capable of dealing with 
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sharp edges, like TV regularization, does not improve upon the inhomoge- 
neous diffusion, or even performs worse. A possible explanation is that the 
multiresolution criterion itself is able to detect edges, thus eliminating the 
major drawback of the diffusion process - whilst keeping its advantageous 
properties in smooth areas. TV regularization with a global parameter, how- 
ever, already is capable of reconstructing sharp edges and can localize them, 
thus it cannot benefit as much from the multiresolution criterion's ability to 
choose the parameter locally, especially as the smoothness of the reconstruc- 
tion cannot be improved. 

We emphasize once again that the multiresolution analysis can be ex- 
tended to other error distributions as well, see Section |3j Also, general- 
izations to higher dimensions are possible; in particular, three-dimensional 
images or movies could be treated efficiently, too. 
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A Proofs 

The proof of Theorem [T] on page 12 will be based on the following lemma 



which is known as Berman's Inequality, see e.g. (Leadbetter et al. 1983 
Theorem 4.2.1). 

Lemma 1. Let . . . , £jv) be a Gaussian vector with standard margins and 
covariance matrix (pij)ij=i,...,N , o^nd let rji, ... ,rj N be independent standard 
Gaussian variables. Then, for every u > 0, 



max £; < u 

Ki<N 



max 7?j < u 

Ki<N 



<- £ 



exp 



u 



1 + Pij 



l<i<j<N \ 1 1 

Proof of Theorem [l] on page 12, Let ujy = a N + 6jyr, with a N and bjy 
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as in (21). We then have, using Berman's Inequality, 



max Q < u N 

Ki<N ~ 



max rt: <um 

Ki<N 



<o(i) Yl icov(^,er)i«p(- 

l<i<j<N - 



U 



N 



1+P 



< 0(iV 1+e )exp 



u 



N 



1 + P 



as iV — > oo. Noting that ~ v^TogiV and choosing e small enough, we 
see that the right-hand side is o(l) as iV -)■ oo. Combining this estimate 
with (25), we obtain (20), which finishes the proof of the theorem. □ 



Proof of Theorem [2] on page |15j . We are going to apply Theorem [T] with 
N = %P n to the random vector (£p ) pev n , where 



AN) 



PeV n . 



First we show that QlQp holds. Let P 1 and P 2 be two different dyadic 

0, so 



subcubes of K n . If P 1 fl P 2 = 0, then we have Cov(£p , £ ft 



(TV) JNh 



that (19) is fulfilled. If P 1 nP 2 ^ 0, then we have either P x C P 2 or P 2 C P x . 



Assuming that, say, Pi C P 2 , we obtain (due to the dyadic structure) that 
ttPi < tt^/2, and hence, 



cov($°,$; 



H(PinP 2 ) 



/ K < 1 



W2~ V2' 



Now we show that ( 18 ) holds. A dyadic cube with side length 2 k contains 
2 d ( fc_i ) dyadic cubes with side length 2 l , where < i < k. Therefore, the 
total number of dyadic cubes which are contained in a dyadic cube with side 
length 2 h is YliZo 2 d ^~^ < 2 d( - k+1 \ Further, the number of dyadic cubes in 
V n having side length 2 k is \n/2 k \ d . Note also that, trivially, %P n > n d . So, 



we may estimate the cardinality on the left-hand side of (18) from above by 



[log n/ log 2 J 

2 l n / 2k \ d ■ 2<fc+1) = 0(n d \ogn) = 0{$V l n +£ ) 



k=0 



as N — > 00 and for every e > 0. This finishes the proof. 



□ 
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